Skip to content

gemm x86 support out_elemtype, multiheadattention and sdpa x86 support bf16 storage, skip mha bf16 tests#6623

Merged
nihui merged 27 commits intoTencent:masterfrom
nihui:sdpa-x86-bf16s
Mar 31, 2026
Merged

gemm x86 support out_elemtype, multiheadattention and sdpa x86 support bf16 storage, skip mha bf16 tests#6623
nihui merged 27 commits intoTencent:masterfrom
nihui:sdpa-x86-bf16s

Conversation

@nihui
Copy link
Copy Markdown
Member

@nihui nihui commented Mar 30, 2026

No description provided.

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Mar 30, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 93.53%. Comparing base (18a7ad1) to head (4ed6121).
⚠️ Report is 2 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #6623      +/-   ##
==========================================
+ Coverage   93.45%   93.53%   +0.08%     
==========================================
  Files         874      874              
  Lines      280098   281088     +990     
==========================================
+ Hits       261758   262921    +1163     
+ Misses      18340    18167     -173     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@tencent-adm
Copy link
Copy Markdown
Member

tencent-adm commented Mar 30, 2026

CLA assistant check
Thank you for your submission, we really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR extends x86 compute paths to better support bf16 storage and Gemm output element type selection, and updates the test suite accordingly (including temporarily skipping MultiHeadAttention bf16 variants).

Changes:

  • Add output_elemtype handling to the x86 bf16 Gemm implementation so bf16 inputs can produce fp32 outputs.
  • Enable bf16 storage support flags for x86 MultiHeadAttention and SDPA, adjusting internal execution to accommodate bf16 storage.
  • Add a new Gemm test (test_gemm_5.cpp) and update test utilities to skip MultiHeadAttention bf16 testing.

Reviewed changes

Copilot reviewed 5 out of 6 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
tests/testutil.cpp Skips MultiHeadAttention bf16 tests; adds missing delete op on Vulkan skip paths (but early-return cleanup still incomplete).
tests/test_gemm_5.cpp New Gemm test covering output_elemtype=fp32 across shapes/transposes.
src/layer/x86/sdpa_x86.cpp Enables bf16 storage and updates intermediate/output allocations and memcpy sizes to respect bf16 elemsize.
src/layer/x86/multiheadattention_x86.cpp Enables bf16 storage; forces certain sublayers to fp32 and adds a bf16→fp32 cast for V before qkv gemm.
src/layer/x86/gemm_x86.cpp Threads output_elemtype through bf16 Gemm path and allocates/stores fp32 when requested.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 6 changed files in this pull request and generated 10 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@tencent-adm
Copy link
Copy Markdown
Member

CLA assistant check
Thank you for your submission, we really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

@nihui
Copy link
Copy Markdown
Member Author

nihui commented Mar 31, 2026

  MultiHeadAttention (MHA)                                                                                                                                                                                                                  
                                                                                                                                                                                                                                            
  ┌────────┬─────────────────────────────┬─────────────────────────────┐                                                                                                                                                                    
  │ 线程数  │ bf16 (无AVX512BF16) vs fp32 │ bf16 (有AVX512BF16) vs fp32 │                                                                                                                                                                    
  ├────────┼─────────────────────────────┼─────────────────────────────┤
  │      1 │                0.94x (略慢)  │                       1.37x │
  ├────────┼─────────────────────────────┼─────────────────────────────┤
  │      2 │                       0.97x │                       1.31x │
  ├────────┼─────────────────────────────┼─────────────────────────────┤
  │      4 │                       0.97x │                       1.24x │
  ├────────┼─────────────────────────────┼─────────────────────────────┤
  │      8 │                       0.92x │                       1.14x │
  └────────┴─────────────────────────────┴─────────────────────────────┘

  - bf16 + AVX512BF16 指令 在 MHA 上效果明显,单线程可达 1.37x 加速
  - 小 seqlen + 大 embed_dim 收益最大(如 E=1024, S=128, 8线程: fp32=622 → bf16+avx512bf16=791 GFLOPS)
  - 无 AVX512BF16 的 bf16 反而稍慢 (~5-8%),因为有 bf16↔fp32 转换开销但没有原生 bf16 运算加速
                                                                                                                                                                                                                                            
  SDPA
                                                                                                                                                                                                                                            
  ┌────────┬─────────────────────────────┬─────────────────────────────┐
  │ 线程数  │ bf16 (无AVX512BF16) vs fp32 │ bf16 (有AVX512BF16) vs fp32 │
  ├────────┼─────────────────────────────┼─────────────────────────────┤
  │      1 │                       0.82x │                       0.99x │
  ├────────┼─────────────────────────────┼─────────────────────────────┤
  │      2 │                       0.71x │                       0.83x │
  ├────────┼─────────────────────────────┼─────────────────────────────┤
  │      4 │                       0.80x │                       0.93x │
  ├────────┼─────────────────────────────┼─────────────────────────────┤
  │      8 │                       0.75x │                       0.85x │
  └────────┴─────────────────────────────┴─────────────────────────────┘

  - SDPA 的 bf16 收益远不如 MHA,大部分场景几何均值低于 fp32
  - 原因:SDPA 纯做注意力计算(QK^T + softmax + QKV),计算量相对少、访存密集,bf16 的 compute 优势在此不明显
  - 在大 seqlen (≥256) 时 bf16+AVX512BF16 才开始接近或略超 fp32(1.02x~1.13x)
  - 小 seqlen(32~128)在多线程下反而严重退化,可能是 packing/conversion 开销占比过高

  关键结论

  1. AVX512BF16 指令对 MHA 有显著加速(1.14x~1.37x),主要受益于 Q/K/V/Out 四个大矩阵投影的 bf16 gemm
  2. 无 AVX512BF16 的 bf16 基本没有正向收益,建议仅在硬件支持 AVX512BF16 时启用 bf16
  3. SDPA 的 bf16 优化空间有限,瓶颈在访存而非计算

@nihui nihui merged commit 371bbad into Tencent:master Mar 31, 2026
106 of 109 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants